Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees

Matteo Ceccarello

University of Padova

Joint work with Johann Gamper (U. Bolzano)

Timeseries and subsequences

Fix a subsequence length \(w\)

  • The Euclidean distance measures the similarity between subsequences
  • Subsequences can be seen as points in a \(w\) dimensional space

This timeseries is from an experiment recording the behavior of a Circulifer tenellus insect. In particular, the subsequences below are associated with a particular behavior while feeding.

Top-\(k\) motifs

Definition 1 Given a time series \(T\) and a length \(w\), the top-\(k\) motifs are the \(k\) pairs of subsequences of \(T\) of length \(w\) with smallest distance, such that no two subsequences in any pair overlap with each other.

Our contribution

  • We propose Attimo, a randomized algorithm for discovering the exact top-\(k\) motifs
  • Attimo auto tunes its parameters to the data distribution
  • Attimo allows to control the success probability, i.e. its recall
  • Attimo can handle time series of one billion values in just half an hour

State of the art: Matrix Profile

Algorithm

  • For each subsequence, find the nearest neighbor
  • Find the subsequence with the closest nearest neighbor

Pros

  • Leverage the fact that consecutive subsequences share a lot of structure
  • Parallelize with many GPUs
  • Finds motifs out of 1 billion long time series in one day, using 5 GPUs

Cons

  • it’s a \(\Omega(n^2)\) algorithm, for timeseries of length \(n\)

Goal

  • Find the top motifs without computing all-pairs-distances

Challenges

  • Subsequences can be seen as vectors
  • These vectors are high dimensional
  • Curse of dimensionality: indices such as R-trees degrade to linear scans

Locality Sensitive Hashing

Subsequences of length \(w\) are points in \(R^w\),
with Euclidean distance.

  • choose \(r \in \mathbb{R}^+\)
  • sample \(\vec{a} \sim \mathcal{N}(0,1)^w\), \(b \sim \mathcal{U}(0,r)\)

\[ h(\vec{x}) = \left\lfloor\frac{\vec{a} \cdot \vec{x} + b}{r}\right\rfloor \]

The key point is that we only compute the distance of subsequences falling
into the same bucket.

To lower the collision probability we concatenate \(\tau\) hash functions \[ \hat{h}(\vec{x}) = \langle h_1(\vec{x}), \dots, h_\tau(\vec{x}) \rangle \] this makes for a better precision of the index.

And to increase the recall of the index we repeat \(L\) times.

Computing the success probability

  • Consider a distance \(d\)
  • Assume to have done \(L\) LSH repetitions with \(\tau\) concatenations
  • We can compute the probability of having seen at least once all pairs at distance \(d\)

A simple algorithm

Auto tuning parameters

  • How do we set \(\tau\)?
  • The wrong value might make us do too many repetitions
  • We adopt an approach that auto tunes the parameters based on the data at hand

  • \(L_{max}\) maximum number of repetitions,
  • \(\tau_{max}\) maximum number of concatenations (e.g. 4),
  • \(\delta\) probability parameter: succeed with probability \(\ge 1-\delta\).

Repetition 1

Repetition 2

Guarantees

Theorem 1 Our algorithm finds the exact top-\(k\) motifs each with probability \(\ge 1-\delta\).

  • For \(\delta=0.1\), each true motif is found with 90% probability
  • This means that we can control the recall of the algorithm
  • Tradeoff: the higher the desired recall, the slower the discovery process

Complexity

Theorem 2 The LSH index construction takes time \[ O(\tau_{max} \cdot \sqrt{L_{max}}\; n\log n) \]

Theorem 3 Let \(d(m_k)\) be the distance of the \(k\)-th motif, and \(i'\le \tau_{max}\), \(j' \le L_{max}\) be the parameters used by the optimal LSH algorithm. Then, the algorithm considers \[ O\left( j'\sum_{m\in T^w\times T^w} p(d(m))^{i'} + (L_{max}-j')\sum_{m\in T^w\times T^w} p(d(m))^{i'+1} \right) \] pairs in expectation.

Experimental results

Find the top-10 motifs.

dataset \(n\) (millions) RC
astro 1 8.63
GAP 2 9.17
freezer 7 7.95
ECG 7 109.06
HumanY 26 581.03
Whales 308 21.66
Seismic 1000 274.44

Relative Contrast measures difficulty: higher is easier. \[ RC = \frac{d_{avg}}{d_{motif}} \]

Scalability

Synthetic data, planted motif.

Practical takeaways

  • For shorter time series, or if the relative contrast is very small, use the Matrix Profile.

  • For time series of a few million values and above, with a not-to-small relative contrast, try Attimo

  • Attimo gives you control over the recall, and adapts to the data distribution

References

  • Matteo Ceccarello, Johann Gamper: Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees. Proc. VLDB Endow. 15(13): 3841-3853 (2022)

https://github.com/Cecca/attimo



import pyattimo
ts = pyattimo.load_dataset("ecg", prefix=1000000)
motifs = pyattimo.MotifsIterator(
    ts, 
    w=1000
)
m = next(motifs)
matteo.ceccarello@unipd.it

Appendix

Influence of \(L_{max}\)

Running for top-10 motifs, for different number of repetitions.

Difficult datasets

Data from LIGO:

  • Length 100k, window 1k
  • Top motif distance \(\approx 40\)
  • Average distance \(\approx 44\)
  • Relative contrast \(\approx 1.1\)
  • Attimo: 6 hours
  • SCAMP: \(\approx 1\) second

freezer and the 7-th motif

7M points, window 5000

  • In black are the distances of the top-10 motifs extracted from the matrix profile.
  • In red the distance of a pair of subsequences neither of which is the nearest neighbor of the other, and not overlapping with higher-ranked motifs.
  • The matrix profile holds the distances and indices of the 1-nearest neighbor of each subsequence, but top-k motifs would require the k-nearest neighbors to be maintained in the matrix profile.

Top-k and k-NN

  • Solid lines are nearest neighbor distances
  • The dashed line is the distance of the top-2 pair in the definition we are using
  • The Matrix Profile contains information about 1-nearest neighbors (solid black lines)